library(gdata)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(kableExtra)
library(tidyverse)
library(stringr)
library(lubridate)
library(scales)
library(graphics)
library(caret)Multiple Linear Regression
Ames Housing Data
1 Setting up the environment
2 Introduction
2.1 Background
The Ames Housing dataset, an improvement over the well-known Boston Housing dataset, has become an essential tool for budding data scientists and researchers focusing on regression problems. Comprising 2930 observations and many explanatory variables ranging from the quality of streets to the number of fireplaces, it paints a detailed picture of the residential homes in Ames, Iowa.
2.2 Objective
This analysis aims to construct a predictive model that, using the various features of a home, accurately predicts its sale price. Such a model has practical applications for various stakeholders in the real estate market.
2.3 Importance
With a robust prediction model, potential homeowners can gauge if a listing is over or underpriced, sellers can ensure they’re placing a competitive price, and real estate professionals can provide evidence-backed advice to clients.
2.4 Methodology Overview
Our approach is systematic:
- Data Exploration - Getting familiar with the data’s structure.
- Data Preprocessing - Preparing data for modeling.
- Feature Engineering - Optimizing representation of the predictors.
- Feature Selection - Using forward selection, backward elimination, and best subset selection.
- Model Selection & Training - Choosing an algorithm and training it.
- Model Evaluation - Validating the model’s accuracy.
- Prediction & Conclusion - Making predictions and concluding the analysis.
3 Data Exploration in Ames Housing Dataset
3.1 Loading the Data
This section focuses on loading the necessary libraries and the dataset for our exploration.
# Loading the required libraries
library(tidyverse)
library(corrplot)# Loading the Ames Housing dataset
ames_data <- read.csv("../data/AmesHousing2.csv")
ggplot(data = ames_data, aes(x = Gr.Liv.Area, y = SalePrice)) +
geom_point(color='#336699') +
labs(x = "Ground Living Area", y = "Sale Price") +
ggtitle("Scatterplot of Sale Price vs. Gr.Liv.Area")# Loading the Ames Housing dataset
filtered_ames_data <- ames_data[ames_data$Gr.Liv.Area <= 4000, ]
ggplot(data = filtered_ames_data, aes(x = Gr.Liv.Area, y = SalePrice)) +
geom_point(color='#336699') +
labs(x = "Ground Living Area", y = "Sale Price") +
ggtitle("Scatterplot of Sale Price vs. Gr.Liv.Area")3.2 Basic Overview
Before diving deeper into the data, it’s essential to get an overall sense of its structure, summary statistics, and a peek into the first few rows.
3.2.1 View the first few rows of the dataset
library(knitr)
library(kableExtra)
# Create a scrollable kable
kable(ames_data[1:10, ], "html") %>%
kable_styling("striped", full_width = T, position = "center") | Order | Lot.Frontage | Lot.Area | Lot.Shape | Utilities | Lot.Config | Land.Slope | Neighborhood | Bldg.Type | House.Style | Overall.Qual | Overall.Cond | Year.Built | Year.Remod.Add | Foundation | Bsmt.Unf.SF | Total.Bsmt.SF | BaseLivArea | Central.Air | X1st.Flr.SF | X2nd.Flr.SF | Gr.Liv.Area | Full.Bath | Half.Bath | Bathrooms | Bedroom.AbvGr | Kitchen.AbvGr | Kitchen.Qual | TotRms.AbvGrd | Fireplaces | Garage.Type | Garage.Area | Wood.Deck.SF | Open.Porch.SF | Enclosed.Porch | X3Ssn.Porch | Screen.Porch | Mo.Sold | Yr.Sold | Sale.Condition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 117 | 80 | 9600 | Reg | AllPub | Inside | Gtl | NWAmes | 1Fam | 2Story | 6 | 6 | 1971 | 1971 | CBlock | 386 | 715 | 329 | Y | 930 | 715 | 1645 | 1 | 2 | 2 | 4 | 1 | TA | 7 | 0 | Attchd | 441 | 0 | 78 | 0 | 0 | 0 | 6 | 2010 | Normal | 171000 |
| 325 | 94 | 9400 | Reg | AllPub | Corner | Gtl | Mitchel | Duplex | 2Story | 6 | 5 | 1971 | 1971 | CBlock | 912 | 912 | 0 | Y | 912 | 912 | 1824 | 2 | 2 | 3 | 4 | 2 | TA | 8 | 0 | NA | 0 | 128 | 0 | 0 | 0 | 0 | 4 | 2010 | Normal | 139000 |
| 337 | 70 | 7700 | Reg | AllPub | Inside | Gtl | Mitchel | Duplex | 2Story | 5 | 2 | 1985 | 1986 | PConc | 1216 | 1216 | 0 | Y | 1216 | 1216 | 2432 | 4 | 2 | 5 | 4 | 2 | TA | 10 | 0 | Attchd | 616 | 200 | 0 | 0 | 0 | 0 | 2 | 2010 | Normal | 159000 |
| 393 | 60 | 9000 | Reg | AllPub | FR2 | Gtl | NAmes | Duplex | 2Story | 5 | 5 | 1974 | 1974 | CBlock | 896 | 896 | 0 | Y | 896 | 896 | 1792 | 2 | 2 | 3 | 4 | 2 | TA | 8 | 0 | NA | 0 | 32 | 45 | 0 | 0 | 0 | 9 | 2009 | Normal | 136000 |
| 590 | NA | 13774 | IR1 | AllPub | Inside | Gtl | NWAmes | 1Fam | 2Story | 7 | 7 | 1977 | 1992 | PConc | 476 | 908 | 432 | Y | 1316 | 972 | 2288 | 1 | 2 | 2 | 4 | 1 | Gd | 8 | 2 | Attchd | 520 | 321 | 72 | 0 | 0 | 156 | 11 | 2009 | Normal | 230000 |
| 667 | 81 | 9671 | Reg | AllPub | Corner | Gtl | NAmes | Duplex | 2Story | 6 | 5 | 1969 | 1969 | CBlock | 1248 | 1248 | 0 | Y | 1248 | 1296 | 2544 | 2 | 2 | 3 | 6 | 2 | TA | 12 | 0 | Attchd | 907 | 0 | 0 | 0 | 0 | 0 | 8 | 2009 | Normal | 190000 |
| 670 | 91 | 11643 | Reg | AllPub | Inside | Gtl | NAmes | Duplex | 2Story | 5 | 5 | 1969 | 1969 | CBlock | 748 | 1248 | 500 | Y | 1338 | 1296 | 2634 | 2 | 2 | 3 | 6 | 2 | TA | 12 | 0 | Detchd | 968 | 0 | 0 | 0 | 0 | 0 | 8 | 2009 | Normal | 200000 |
| 809 | 64 | 7018 | Reg | AllPub | Inside | Gtl | SawyerW | Duplex | SFoyer | 5 | 5 | 1979 | 1979 | CBlock | 0 | 1086 | 1086 | Y | 1224 | 0 | 1224 | 0 | 2 | 1 | 2 | 2 | TA | 6 | 2 | Detchd | 528 | 120 | 0 | 0 | 0 | 0 | 6 | 2009 | Alloca | 153337 |
| 814 | 64 | 7040 | Reg | AllPub | Inside | Gtl | SawyerW | Duplex | SFoyer | 5 | 5 | 1979 | 1979 | CBlock | 0 | 1094 | 1094 | Y | 1229 | 0 | 1229 | 0 | 2 | 1 | 2 | 2 | Gd | 6 | 2 | Detchd | 672 | 120 | 0 | 0 | 0 | 0 | 6 | 2009 | Alloca | 148325 |
| 816 | NA | 11855 | Reg | AllPub | Inside | Gtl | SawyerW | Duplex | 2Story | 7 | 5 | 2000 | 2000 | PConc | 348 | 1168 | 820 | Y | 1168 | 1619 | 2787 | 4 | 2 | 5 | 6 | 2 | TA | 8 | 2 | BuiltIn | 820 | 312 | 0 | 0 | 0 | 0 | 10 | 2009 | Normal | 269500 |
# scroll_box(width = "750px", height = "300px")3.2.2 Display the structure of the dataset
# str(ames_data)
library(skimr)
library(gt)
ames_data |>
skimr::skim() |>
gt::gt()| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | Lot.Shape | 0 | 1.0000000 | 3 | 3 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Utilities | 0 | 1.0000000 | 6 | 6 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Lot.Config | 0 | 1.0000000 | 3 | 7 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Land.Slope | 0 | 1.0000000 | 3 | 3 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Neighborhood | 0 | 1.0000000 | 5 | 7 | 0 | 28 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Bldg.Type | 0 | 1.0000000 | 4 | 6 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | House.Style | 0 | 1.0000000 | 4 | 6 | 0 | 8 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Foundation | 0 | 1.0000000 | 4 | 6 | 0 | 6 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Central.Air | 0 | 1.0000000 | 1 | 1 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Kitchen.Qual | 0 | 1.0000000 | 2 | 2 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Garage.Type | 157 | 0.9464164 | 6 | 7 | 0 | 6 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Sale.Condition | 0 | 1.0000000 | 6 | 7 | 0 | 6 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | Order | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.465500e+03 | 8.459625e+02 | 1 | 733.25 | 1465.5 | 2197.75 | 2930 | ▇▇▇▇▇ |
| numeric | Lot.Frontage | 490 | 0.8327645 | NA | NA | NA | NA | NA | 6.922459e+01 | 2.336533e+01 | 21 | 58.00 | 68.0 | 80.00 | 313 | ▇▃▁▁▁ |
| numeric | Lot.Area | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.014792e+04 | 7.880018e+03 | 1300 | 7440.25 | 9436.5 | 11555.25 | 215245 | ▇▁▁▁▁ |
| numeric | Overall.Qual | 0 | 1.0000000 | NA | NA | NA | NA | NA | 6.094881e+00 | 1.411026e+00 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
| numeric | Overall.Cond | 0 | 1.0000000 | NA | NA | NA | NA | NA | 5.563140e+00 | 1.111537e+00 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
| numeric | Year.Built | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.971356e+03 | 3.024536e+01 | 1872 | 1954.00 | 1973.0 | 2001.00 | 2010 | ▁▂▃▆▇ |
| numeric | Year.Remod.Add | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.984267e+03 | 2.086029e+01 | 1950 | 1965.00 | 1993.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
| numeric | Bsmt.Unf.SF | 1 | 0.9996587 | NA | NA | NA | NA | NA | 5.592625e+02 | 4.394942e+02 | 0 | 219.00 | 466.0 | 802.00 | 2336 | ▇▅▂▁▁ |
| numeric | Total.Bsmt.SF | 1 | 0.9996587 | NA | NA | NA | NA | NA | 1.051615e+03 | 4.406151e+02 | 0 | 793.00 | 990.0 | 1302.00 | 6110 | ▇▃▁▁▁ |
| numeric | BaseLivArea | 0 | 1.0000000 | NA | NA | NA | NA | NA | 4.921840e+02 | 4.773283e+02 | 0 | 0.00 | 459.5 | 808.00 | 5644 | ▇▁▁▁▁ |
| numeric | X1st.Flr.SF | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.159558e+03 | 3.918909e+02 | 334 | 876.25 | 1084.0 | 1384.00 | 5095 | ▇▃▁▁▁ |
| numeric | X2nd.Flr.SF | 0 | 1.0000000 | NA | NA | NA | NA | NA | 3.354560e+02 | 4.283957e+02 | 0 | 0.00 | 0.0 | 703.75 | 2065 | ▇▃▂▁▁ |
| numeric | Gr.Liv.Area | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.499690e+03 | 5.055089e+02 | 334 | 1126.00 | 1442.0 | 1742.75 | 5642 | ▇▇▁▁▁ |
| numeric | Full.Bath | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.566553e+00 | 5.529406e-01 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▇▇▁▁ |
| numeric | Half.Bath | 0 | 1.0000000 | NA | NA | NA | NA | NA | 3.795222e-01 | 5.026293e-01 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
| numeric | Bathrooms | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.756314e+00 | 6.428715e-01 | 0 | 1.00 | 2.0 | 2.50 | 5 | ▆▇▅▁▁ |
| numeric | Bedroom.AbvGr | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2.854266e+00 | 8.277311e-01 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
| numeric | Kitchen.AbvGr | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.044369e+00 | 2.140762e-01 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
| numeric | TotRms.AbvGrd | 0 | 1.0000000 | NA | NA | NA | NA | NA | 6.443003e+00 | 1.572964e+00 | 2 | 5.00 | 6.0 | 7.00 | 15 | ▁▇▂▁▁ |
| numeric | Fireplaces | 0 | 1.0000000 | NA | NA | NA | NA | NA | 5.993174e-01 | 6.479209e-01 | 0 | 0.00 | 1.0 | 1.00 | 4 | ▇▇▁▁▁ |
| numeric | Garage.Area | 1 | 0.9996587 | NA | NA | NA | NA | NA | 4.728197e+02 | 2.150465e+02 | 0 | 320.00 | 480.0 | 576.00 | 1488 | ▃▇▃▁▁ |
| numeric | Wood.Deck.SF | 0 | 1.0000000 | NA | NA | NA | NA | NA | 9.375188e+01 | 1.263616e+02 | 0 | 0.00 | 0.0 | 168.00 | 1424 | ▇▁▁▁▁ |
| numeric | Open.Porch.SF | 0 | 1.0000000 | NA | NA | NA | NA | NA | 4.753345e+01 | 6.748340e+01 | 0 | 0.00 | 27.0 | 70.00 | 742 | ▇▁▁▁▁ |
| numeric | Enclosed.Porch | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2.301160e+01 | 6.413906e+01 | 0 | 0.00 | 0.0 | 0.00 | 1012 | ▇▁▁▁▁ |
| numeric | X3Ssn.Porch | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2.592491e+00 | 2.514133e+01 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
| numeric | Screen.Porch | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.600205e+01 | 5.608737e+01 | 0 | 0.00 | 0.0 | 0.00 | 576 | ▇▁▁▁▁ |
| numeric | Mo.Sold | 0 | 1.0000000 | NA | NA | NA | NA | NA | 6.216041e+00 | 2.714492e+00 | 1 | 4.00 | 6.0 | 8.00 | 12 | ▅▆▇▃▃ |
| numeric | Yr.Sold | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2.007790e+03 | 1.316613e+00 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▃ |
| numeric | SalePrice | 0 | 1.0000000 | NA | NA | NA | NA | NA | 1.807961e+05 | 7.988669e+04 | 12789 | 129500.00 | 160000.0 | 213500.00 | 755000 | ▇▇▁▁▁ |
3.2.3 Display summary statistics for the dataset
# Install and load the required library if not already done
library(gt)
summary(ames_data)#> Order Lot.Frontage Lot.Area Lot.Shape
#> Min. : 1.0 Min. : 21.00 Min. : 1300 Length:2930
#> 1st Qu.: 733.2 1st Qu.: 58.00 1st Qu.: 7440 Class :character
#> Median :1465.5 Median : 68.00 Median : 9436 Mode :character
#> Mean :1465.5 Mean : 69.22 Mean : 10148
#> 3rd Qu.:2197.8 3rd Qu.: 80.00 3rd Qu.: 11555
#> Max. :2930.0 Max. :313.00 Max. :215245
#> NA's :490
#> Utilities Lot.Config Land.Slope Neighborhood
#> Length:2930 Length:2930 Length:2930 Length:2930
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#>
#> Bldg.Type House.Style Overall.Qual Overall.Cond
#> Length:2930 Length:2930 Min. : 1.000 Min. :1.000
#> Class :character Class :character 1st Qu.: 5.000 1st Qu.:5.000
#> Mode :character Mode :character Median : 6.000 Median :5.000
#> Mean : 6.095 Mean :5.563
#> 3rd Qu.: 7.000 3rd Qu.:6.000
#> Max. :10.000 Max. :9.000
#>
#> Year.Built Year.Remod.Add Foundation Bsmt.Unf.SF
#> Min. :1872 Min. :1950 Length:2930 Min. : 0.0
#> 1st Qu.:1954 1st Qu.:1965 Class :character 1st Qu.: 219.0
#> Median :1973 Median :1993 Mode :character Median : 466.0
#> Mean :1971 Mean :1984 Mean : 559.3
#> 3rd Qu.:2001 3rd Qu.:2004 3rd Qu.: 802.0
#> Max. :2010 Max. :2010 Max. :2336.0
#> NA's :1
#> Total.Bsmt.SF BaseLivArea Central.Air X1st.Flr.SF
#> Min. : 0 Min. : 0.0 Length:2930 Min. : 334.0
#> 1st Qu.: 793 1st Qu.: 0.0 Class :character 1st Qu.: 876.2
#> Median : 990 Median : 459.5 Mode :character Median :1084.0
#> Mean :1052 Mean : 492.2 Mean :1159.6
#> 3rd Qu.:1302 3rd Qu.: 808.0 3rd Qu.:1384.0
#> Max. :6110 Max. :5644.0 Max. :5095.0
#> NA's :1
#> X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath
#> Min. : 0.0 Min. : 334 Min. :0.000 Min. :0.0000
#> 1st Qu.: 0.0 1st Qu.:1126 1st Qu.:1.000 1st Qu.:0.0000
#> Median : 0.0 Median :1442 Median :2.000 Median :0.0000
#> Mean : 335.5 Mean :1500 Mean :1.567 Mean :0.3795
#> 3rd Qu.: 703.8 3rd Qu.:1743 3rd Qu.:2.000 3rd Qu.:1.0000
#> Max. :2065.0 Max. :5642 Max. :4.000 Max. :2.0000
#>
#> Bathrooms Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
#> Min. :0.000 Min. :0.000 Min. :0.000 Length:2930
#> 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 Class :character
#> Median :2.000 Median :3.000 Median :1.000 Mode :character
#> Mean :1.756 Mean :2.854 Mean :1.044
#> 3rd Qu.:2.500 3rd Qu.:3.000 3rd Qu.:1.000
#> Max. :5.000 Max. :8.000 Max. :3.000
#>
#> TotRms.AbvGrd Fireplaces Garage.Type Garage.Area
#> Min. : 2.000 Min. :0.0000 Length:2930 Min. : 0.0
#> 1st Qu.: 5.000 1st Qu.:0.0000 Class :character 1st Qu.: 320.0
#> Median : 6.000 Median :1.0000 Mode :character Median : 480.0
#> Mean : 6.443 Mean :0.5993 Mean : 472.8
#> 3rd Qu.: 7.000 3rd Qu.:1.0000 3rd Qu.: 576.0
#> Max. :15.000 Max. :4.0000 Max. :1488.0
#> NA's :1
#> Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
#> Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
#> 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
#> Median : 0.00 Median : 27.00 Median : 0.00 Median : 0.000
#> Mean : 93.75 Mean : 47.53 Mean : 23.01 Mean : 2.592
#> 3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.00 3rd Qu.: 0.000
#> Max. :1424.00 Max. :742.00 Max. :1012.00 Max. :508.000
#>
#> Screen.Porch Mo.Sold Yr.Sold Sale.Condition
#> Min. : 0 Min. : 1.000 Min. :2006 Length:2930
#> 1st Qu.: 0 1st Qu.: 4.000 1st Qu.:2007 Class :character
#> Median : 0 Median : 6.000 Median :2008 Mode :character
#> Mean : 16 Mean : 6.216 Mean :2008
#> 3rd Qu.: 0 3rd Qu.: 8.000 3rd Qu.:2009
#> Max. :576 Max. :12.000 Max. :2010
#>
#> SalePrice
#> Min. : 12789
#> 1st Qu.:129500
#> Median :160000
#> Mean :180796
#> 3rd Qu.:213500
#> Max. :755000
#>
3.3 Histogram of SalePrice
To understand the distribution of our target variable, SalePrice, we’ll visualize it using a histogram.
# Plotting a histogram for SalePrice
ggplot(data = ames_data, aes(x = SalePrice)) +
geom_histogram(fill = '#336699', color = "black", bins = 50) +
labs(title = "Histogram of SalePrice", x = "SalePrice", y = "Frequency") +
theme_minimal()3.4 Correlation Analysis
3.4.1 Correlation with respect to SalePrice
First, let’s identify how the continuous variables in our dataset correlate with SalePrice.
# Calculating correlations of continuous variables with SalePrice
continuous_vars <- ames_data %>% select_if(is.numeric)
correlations <- cor(continuous_vars)
saleprice_correlations <- correlations['SalePrice',]
saleprice_correlations_df <- data.frame(Variable = names(saleprice_correlations),
Correlation = saleprice_correlations) %>%
arrange(desc(Correlation))
# Plotting these correlations
ggplot(data = saleprice_correlations_df, aes(x = reorder(Variable, Correlation), y = Correlation)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Correlation of Continuous Variables with SalePrice", x = "", y = "Correlation") +
theme_minimal()#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
3.4.2 Correlation among Continuous Variables
Understanding how continuous variables correlate with each other can provide insights into potential multicollinearity and relationships between variables.
# Remove columns with zero variance or too many NAs
filtered_continuous_vars <- continuous_vars %>%
select_if(function(x) var(x, na.rm = TRUE) > 0 & mean(is.na(x)) < 0.5)
# Recalculate correlations
fixed_correlations <- cor(filtered_continuous_vars, use = "complete.obs")
# Replace NA with 0 in correlation matrix
fixed_correlations[is.na(fixed_correlations)] <- 0
# Plotting correlations among continuous variables
corrplot(fixed_correlations, method = "color", order = "hclust", addCoef.col = "black", tl.cex = 0.75, tl.srt = 45, type= 'lower',number.cex= 0.60)4 Data Preprocessing
Data preprocessing in the context of the Ames Housing dataset is a crucial phase where raw data is cleaned and transformed to make it suitable for modeling. It entails addressing missing values, which might indicate absent housing features, ensuring numerical attributes are on a consistent scale through feature scaling, and converting categorical attributes into a machine-readable format using encoding techniques. Additionally, the dataset might benefit from feature engineering, where new attributes are derived or existing ones are modified to better capture underlying patterns in the housing market. Lastly, a train-test split is performed, partitioning the dataset to ensure models can be trained on one subset and validated on another, allowing for an assessment of their real-world performance.
- Handling Missing Values: In the Ames Housing dataset, missing values might indicate the absence of a feature in a house (like a garage). We can impute these missing values using statistical methods, like replacing with the median for numerical features or the mode for categorical ones, ensuring our models have complete data to learn from.
- Feature Scaling: In datasets like Ames Housing, with numerous numerical features like area or age, feature scaling ensures all numerical attributes have the same scale. This is essential for algorithms like gradient descent or k-NN, where larger-scale features can disproportionately impact the model.
- Encoding Categorical Variables: The Ames dataset contains categorical variables, like neighborhood or house style. Encoding transforms these text-based categories into numerical format, often using techniques like one-hot encoding. This conversion ensures machine learning algorithms, which require numerical input, can process this data.
- Feature Engineering: Feature engineering involves creating new features from the existing ones in the Ames dataset or modifying current attributes to represent the underlying patterns better. For instance, combining ‘1st Floor Area’ and ‘2nd Floor Area’ to get ‘Total Floor Area’ might give a model a more consolidated view of a property’s size.
- Train-Test Split: To assess the performance of our models on unseen data, we divide the Ames dataset into a training set (to train the model) and a testing set (to test its predictions). A standard split might allocate 70% of the data for training and the remaining 30% for testing, ensuring our models are well-trained and well-validated.
4.1 Handling Missing Values
# List columns with missing values
missing_cols <- colnames(ames_data)[colSums(is.na(ames_data)) > 0]For simplicity, we’ll impute missing values with the median for numeric columns and the mode for categorical columns
for (col in missing_cols) {
if (is.numeric(ames_data[[col]])) {
ames_data[[col]][is.na(ames_data[[col]])] <- median(ames_data[[col]], na.rm = TRUE)
} else {
mode_val <- as.character(names(sort(table(ames_data[[col]]), decreasing = TRUE)[1]))
ames_data[[col]][is.na(ames_data[[col]])] <- mode_val
}
}4.2 Feature Scaling - Normalize numeric features
# Not this time
# numeric_cols <- sapply(ames_data, is.numeric)
# ames_data[numeric_cols] <- lapply(ames_data[numeric_cols], scale)4.3 Encoding Categorical Variables
# Not Needed for Now
# ames_data <- model.matrix(~ . - 1, data = ames_data)4.4 Feature Engineering
We won’t perform feature engineering here for simplicity, but this is where you’d create or modify features based on domain knowledge or insights from EDA.
5 5. Train-Test Split
set.seed(123) # For reproducibility
sample_index <- sample(seq_len(nrow(ames_data)), size = 0.7 * nrow(ames_data))
train_data <- ames_data[sample_index, ]
test_data <- ames_data[-sample_index, ]6 Feature Selection
6.1 Simple Linear Regression Model
Fitting a simple linear model between SalePrice and Gr.Liv.Area can be achieved using R’s lm() function.
Here’s how you can create and summarize such a model:
options(scipen=999)
# Fit the linear model on the training data
lm_model <- lm(SalePrice ~ Gr.Liv.Area, data = train_data)
# Summarize the model
# stargazer::stargazer(lm_model, type = "html", title = "Simple Linear Model", ci=TRUE, single.row = FALSE, no.space = FALSE, align = TRUE, digits=5, font.size = "small", report = "vc*stp")
summary(lm_model)#>
#> Call:
#> lm(formula = SalePrice ~ Gr.Liv.Area, data = train_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -477601 -28912 -2320 21280 335224
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 15969.885 3852.872 4.145 0.0000354 ***
#> Gr.Liv.Area 110.179 2.438 45.199 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 55870 on 2049 degrees of freedom
#> Multiple R-squared: 0.4993, Adjusted R-squared: 0.499
#> F-statistic: 2043 on 1 and 2049 DF, p-value: < 0.00000000000000022
The model above tells you that for one unit increase of Gr.Liv.Area Sale Price increases by 15969.8854154 + 110.1791899= 16080.0646053 indicated by the equation SalePrice = 15969.8854154 + 110.1791899 \times Gr.Liv.Area.
6.2 Multiple Linear Regression Model
Let’s run a multiple linear regression model using all available predictors in the train_data to predict SalePrice.
Here’s how you can create and summarize such a model:
# Fit the multiple linear regression model on the training data using all predictors
full_lm_model <- lm(SalePrice ~ ., data = train_data)
# Summarize the model
stargazer::stargazer(full_lm_model, type = "html", title = "Full Linear Model", ci=TRUE, single.row = T, no.space = FALSE, align = TRUE, digits=5, font.size = "small", report = "vc*stp")| Dependent variable: | |
| SalePrice | |
| Order | 4.61917 (-12.05823, 21.29657) |
| t = 0.54286 | |
| p = 0.58730 | |
| Lot.Frontage | -95.08760** (-177.84030, -12.33486) |
| t = -2.25211 | |
| p = 0.02443 | |
| Lot.Area | 0.53936*** (0.34036, 0.73837) |
| t = 5.31221 | |
| p = 0.0000002 | |
| Lot.ShapeIR2 | 9,703.43300** (1,053.18600, 18,353.68000) |
| t = 2.19859 | |
| p = 0.02803 | |
| Lot.ShapeIR3 | -27,454.21000*** (-43,011.15000, -11,897.27000) |
| t = -3.45886 | |
| p = 0.00056 | |
| Lot.ShapeReg | 1,889.03000 (-1,219.37300, 4,997.43400) |
| t = 1.19110 | |
| p = 0.23376 | |
| UtilitiesNoSewr | -4,265.43600 (-44,714.14000, 36,183.27000) |
| t = -0.20668 | |
| p = 0.83628 | |
| Lot.ConfigCulDSac | 8,537.96900** (2,024.38900, 15,051.55000) |
| t = 2.56911 | |
| p = 0.01027 | |
| Lot.ConfigFR2 | -4,781.72300 (-12,971.42000, 3,407.97300) |
| t = -1.14437 | |
| p = 0.25262 | |
| Lot.ConfigFR3 | 2,133.61500 (-15,215.00000, 19,482.23000) |
| t = 0.24105 | |
| p = 0.80955 | |
| Lot.ConfigInside | 1,494.17000 (-2,029.32400, 5,017.66400) |
| t = 0.83114 | |
| p = 0.40600 | |
| Land.SlopeMod | 11,062.63000*** (4,526.68100, 17,598.58000) |
| t = 3.31740 | |
| p = 0.00093 | |
| Land.SlopeSev | -8,812.87300 (-26,974.91000, 9,349.16300) |
| t = -0.95104 | |
| p = 0.34170 | |
| NeighborhoodBlueste | -710.49220 (-27,308.48000, 25,887.49000) |
| t = -0.05236 | |
| p = 0.95826 | |
| NeighborhoodBrDale | 3,728.84400 (-16,239.42000, 23,697.11000) |
| t = 0.36600 | |
| p = 0.71441 | |
| NeighborhoodBrkSide | -13,863.09000 (-31,410.40000, 3,684.21300) |
| t = -1.54845 | |
| p = 0.12168 | |
| NeighborhoodClearCr | -14,493.15000 (-33,763.86000, 4,777.56000) |
| t = -1.47405 | |
| p = 0.14063 | |
| NeighborhoodCollgCr | -13,063.47000 (-29,258.25000, 3,131.30000) |
| t = -1.58100 | |
| p = 0.11404 | |
| NeighborhoodCrawfor | 6,462.82800 (-11,264.91000, 24,190.57000) |
| t = 0.71452 | |
| p = 0.47499 | |
| NeighborhoodEdwards | -27,054.07000*** (-43,879.12000, -10,229.03000) |
| t = -3.15155 | |
| p = 0.00165 | |
| NeighborhoodGilbert | -18,456.61000** (-34,129.62000, -2,783.61200) |
| t = -2.30806 | |
| p = 0.02110 | |
| NeighborhoodGreens | 16,971.38000 (-9,295.00100, 43,237.76000) |
| t = 1.26638 | |
| p = 0.20553 | |
| NeighborhoodGrnHill | 133,832.70000*** (75,826.81000, 191,838.50000) |
| t = 4.52208 | |
| p = 0.00001 | |
| NeighborhoodIDOTRR | -20,440.44000** (-39,059.21000, -1,821.67300) |
| t = -2.15173 | |
| p = 0.03155 | |
| NeighborhoodLandmrk | 6,702.62500 (-51,973.81000, 65,379.06000) |
| t = 0.22389 | |
| p = 0.82287 | |
| NeighborhoodMeadowV | -3,716.85500 (-23,404.00000, 15,970.29000) |
| t = -0.37003 | |
| p = 0.71140 | |
| NeighborhoodMitchel | -18,937.37000** (-36,763.47000, -1,111.27000) |
| t = -2.08215 | |
| p = 0.03746 | |
| NeighborhoodNAmes | -16,121.85000** (-31,760.05000, -483.64530) |
| t = -2.02058 | |
| p = 0.04346 | |
| NeighborhoodNoRidge | 40,872.70000*** (23,658.26000, 58,087.14000) |
| t = 4.65359 | |
| p = 0.000004 | |
| NeighborhoodNPkVill | 2,299.06900 (-19,268.72000, 23,866.86000) |
| t = 0.20893 | |
| p = 0.83453 | |
| NeighborhoodNridgHt | 36,664.38000*** (21,663.44000, 51,665.32000) |
| t = 4.79042 | |
| p = 0.000002 | |
| NeighborhoodNWAmes | -18,966.09000** (-34,931.51000, -3,000.68100) |
| t = -2.32834 | |
| p = 0.02000 | |
| NeighborhoodOldTown | -20,682.09000** (-37,682.38000, -3,681.79900) |
| t = -2.38444 | |
| p = 0.01721 | |
| NeighborhoodSawyer | -17,915.49000** (-34,431.13000, -1,399.86000) |
| t = -2.12609 | |
| p = 0.03363 | |
| NeighborhoodSawyerW | -17,652.24000** (-33,555.59000, -1,748.87900) |
| t = -2.17550 | |
| p = 0.02972 | |
| NeighborhoodSomerst | 8,538.23000 (-6,296.94600, 23,373.41000) |
| t = 1.12804 | |
| p = 0.25945 | |
| NeighborhoodStoneBr | 48,478.99000*** (31,964.51000, 64,993.48000) |
| t = 5.75356 | |
| p = 0.00000 | |
| NeighborhoodSWISU | -19,674.16000** (-39,302.01000, -46.30389) |
| t = -1.96459 | |
| p = 0.04961 | |
| NeighborhoodTimber | -2,105.89400 (-20,421.87000, 16,210.09000) |
| t = -0.22535 | |
| p = 0.82174 | |
| NeighborhoodVeenker | 2,179.58100 (-17,309.48000, 21,668.64000) |
| t = 0.21919 | |
| p = 0.82653 | |
| Bldg.Type2fmCon | -6,896.65700 (-17,014.84000, 3,221.52400) |
| t = -1.33593 | |
| p = 0.18173 | |
| Bldg.TypeDuplex | -11.75693 (-12,266.83000, 12,243.32000) |
| t = -0.00188 | |
| p = 0.99850 | |
| Bldg.TypeTwnhs | -41,073.04000*** (-51,866.13000, -30,279.95000) |
| t = -7.45863 | |
| p = 0.00000 | |
| Bldg.TypeTwnhsE | -29,786.06000*** (-36,684.70000, -22,887.42000) |
| t = -8.46248 | |
| p = 0.00000 | |
| House.Style1.5Unf | 6,206.55100 (-10,207.87000, 22,620.97000) |
| t = 0.74109 | |
| p = 0.45873 | |
| House.Style1Story | 8,663.58800** (1,493.23100, 15,833.94000) |
| t = 2.36813 | |
| p = 0.01798 | |
| House.Style2.5Fin | -16,788.65000 (-44,887.53000, 11,310.23000) |
| t = -1.17105 | |
| p = 0.24173 | |
| House.Style2.5Unf | -571.31310 (-16,207.44000, 15,064.81000) |
| t = -0.07161 | |
| p = 0.94292 | |
| House.Style2Story | -7,381.70000** (-13,470.43000, -1,292.97100) |
| t = -2.37617 | |
| p = 0.01759 | |
| House.StyleSFoyer | 8,663.96200 (-1,846.15800, 19,174.08000) |
| t = 1.61569 | |
| p = 0.10633 | |
| House.StyleSLvl | 1,382.75600 (-7,339.72100, 10,105.23000) |
| t = 0.31071 | |
| p = 0.75606 | |
| Overall.Qual | 11,535.13000*** (9,748.80900, 13,321.45000) |
| t = 12.65643 | |
| p = 0.00000 | |
| Overall.Cond | 5,122.06300*** (3,640.09500, 6,604.03200) |
| t = 6.77414 | |
| p = 0.00000 | |
| Year.Built | 364.55260*** (236.31640, 492.78870) |
| t = 5.57183 | |
| p = 0.0000001 | |
| Year.Remod.Add | 83.40910 (-17.11190, 183.93010) |
| t = 1.62632 | |
| p = 0.10405 | |
| FoundationCBlock | -1,416.79500 (-7,247.69000, 4,414.10000) |
| t = -0.47623 | |
| p = 0.63397 | |
| FoundationPConc | 4,860.30800 (-1,620.57400, 11,341.19000) |
| t = 1.46987 | |
| p = 0.14176 | |
| FoundationSlab | -2,454.59600 (-16,302.93000, 11,393.74000) |
| t = -0.34740 | |
| p = 0.72833 | |
| FoundationStone | 9,785.99200 (-9,763.33900, 29,335.32000) |
| t = 0.98112 | |
| p = 0.32666 | |
| FoundationWood | -13,077.21000 (-43,087.81000, 16,933.39000) |
| t = -0.85406 | |
| p = 0.39318 | |
| Bsmt.Unf.SF | 20.08809 (-92.88957, 133.06580) |
| t = 0.34849 | |
| p = 0.72751 | |
| Total.Bsmt.SF | -14.90117 (-127.87620, 98.07386) |
| t = -0.25852 | |
| p = 0.79604 | |
| BaseLivArea | 37.94264 (-74.94773, 150.83300) |
| t = 0.65875 | |
| p = 0.51014 | |
| Central.AirY | -2,331.77700 (-8,722.74500, 4,059.19000) |
| t = -0.71510 | |
| p = 0.47464 | |
| X1st.Flr.SF | 11.16909 (-21.56201, 43.90020) |
| t = 0.66881 | |
| p = 0.50370 | |
| X2nd.Flr.SF | 21.20470 (-10.27377, 52.68317) |
| t = 1.32028 | |
| p = 0.18690 | |
| Gr.Liv.Area | 27.46222* (-4.73022, 59.65466) |
| t = 1.67198 | |
| p = 0.09469 | |
| Full.Bath | 6,683.69600*** (2,746.96700, 10,620.43000) |
| t = 3.32759 | |
| p = 0.00090 | |
| Half.Bath | 2,363.51700 (-1,456.97200, 6,184.00600) |
| t = 1.21252 | |
| p = 0.22547 | |
| Bathrooms | |
| Bedroom.AbvGr | -2,775.97400** (-5,249.07300, -302.87490) |
| t = -2.20000 | |
| p = 0.02793 | |
| Kitchen.AbvGr | -13,051.86000** (-24,242.05000, -1,861.68000) |
| t = -2.28604 | |
| p = 0.02236 | |
| Kitchen.QualFa | -35,430.56000*** (-46,768.31000, -24,092.82000) |
| t = -6.12490 | |
| p = 0.00000 | |
| Kitchen.QualGd | -37,934.88000*** (-44,089.20000, -31,780.56000) |
| t = -12.08111 | |
| p = 0.00000 | |
| Kitchen.QualTA | -39,483.22000*** (-46,580.57000, -32,385.87000) |
| t = -10.90346 | |
| p = 0.00000 | |
| TotRms.AbvGrd | 1,553.69400* (-174.07630, 3,281.46500) |
| t = 1.76249 | |
| p = 0.07815 | |
| Fireplaces | 5,243.89500*** (2,730.50000, 7,757.29000) |
| t = 4.08923 | |
| p = 0.00005 | |
| Garage.TypeAttchd | 10,760.05000 (-4,364.42100, 25,884.51000) |
| t = 1.39438 | |
| p = 0.16337 | |
| Garage.TypeBasment | 15,279.79000 (-4,558.76000, 35,118.34000) |
| t = 1.50958 | |
| p = 0.13132 | |
| Garage.TypeBuiltIn | 15,600.90000* (-697.97820, 31,899.77000) |
| t = 1.87603 | |
| p = 0.06080 | |
| Garage.TypeCarPort | 5,377.45300 (-18,086.58000, 28,841.49000) |
| t = 0.44918 | |
| p = 0.65336 | |
| Garage.TypeDetchd | 8,604.57100 (-6,408.82300, 23,617.96000) |
| t = 1.12331 | |
| p = 0.26145 | |
| Garage.Area | 25.00584*** (16.00345, 34.00823) |
| t = 5.44417 | |
| p = 0.0000001 | |
| Wood.Deck.SF | 4.67919 (-6.46395, 15.82234) |
| t = 0.82302 | |
| p = 0.41060 | |
| Open.Porch.SF | -4.84744 (-26.26057, 16.56569) |
| t = -0.44369 | |
| p = 0.65732 | |
| Enclosed.Porch | 11.40326 (-9.74494, 32.55147) |
| t = 1.05683 | |
| p = 0.29073 | |
| X3Ssn.Porch | 40.87324 (-17.08306, 98.82953) |
| t = 1.38225 | |
| p = 0.16706 | |
| Screen.Porch | 60.57027*** (37.69270, 83.44783) |
| t = 5.18917 | |
| p = 0.0000003 | |
| Mo.Sold | -340.01420 (-809.94220, 129.91380) |
| t = -1.41812 | |
| p = 0.15632 | |
| Yr.Sold | 2,352.76300 (-8,161.50600, 12,867.03000) |
| t = 0.43858 | |
| p = 0.66102 | |
| Sale.ConditionAdjLand | 3,075.75500 (-19,059.80000, 25,211.31000) |
| t = 0.27234 | |
| p = 0.78540 | |
| Sale.ConditionAlloca | 3,523.68300 (-11,356.82000, 18,404.18000) |
| t = 0.46412 | |
| p = 0.64262 | |
| Sale.ConditionFamily | 6,487.67400 (-5,044.99900, 18,020.35000) |
| t = 1.10257 | |
| p = 0.27035 | |
| Sale.ConditionNormal | 7,880.08300*** (2,538.89900, 13,221.27000) |
| t = 2.89162 | |
| p = 0.00388 | |
| Sale.ConditionPartial | 21,144.76000*** (13,743.88000, 28,545.65000) |
| t = 5.59973 | |
| p = 0.0000001 | |
| Constant | -5,597,142.00000 (-26,756,346.00000, 15,562,063.00000) |
| t = -0.51846 | |
| p = 0.60420 | |
| Observations | 2,051 |
| R2 | 0.87716 |
| Adjusted R2 | 0.87126 |
| Residual Std. Error | 28,321.30000 (df = 1956) |
| F Statistic | 148.59050*** (df = 94; 1956) |
| Note: | p<0.1; p<0.05; p<0.01 |
6.2.1 Regression Diagnostic
Once a linear regression model is fitted, there are four primary assumptions that we should check:
- Linearity: The relationship between the predictors and the response variable should be linear.
- Independence: The residuals should be independent, especially in time series data.
- Homoscedasticity (Constant Variance): The variance of the residuals should be constant across all levels of the independent variables.
- Normality: The residuals should be approximately normally distributed. We commonly use diagnostic plots to diagnose potential violations of these assumptions in R. Here’s a breakdown of these plots and their descriptions:
6.2.1.1 Residual vs. Fitted (or Predicted) Values:
This plot helps check the assumptions of linearity and equal variance (homoscedasticity).
Description: A horizontal band around the 0 reference line indicates that the relationship is linear and the variances are equal. Patterns or trends suggest violations.
plot(full_lm_model, which=1)6.2.1.2 Normal Q-Q Plot:
This plot checks the assumption of normality of the residuals.
Description: If the residuals are normally distributed, they should fall on a roughly straight line at a 45-degree angle. Deviations from this line indicate departures from normality.
plot(full_lm_model, which=2)#> Warning: not plotting observations with leverage one:
#> 94, 541, 1048
6.2.1.3 Scale-Location (or Spread-Location) Plot:
This plot helps check the assumption of equal variance (homoscedasticity).
Description: A horizontal band with equally spread points indicates equal variances. A funnel shape (either narrow at the bottom or top) suggests that the variances change with the fitted values, violating the assumption.
plot(full_lm_model, which=3)#> Warning: not plotting observations with leverage one:
#> 94, 541, 1048
6.2.1.4 Residuals vs. Leverage:
This plot helps identify influential case observations that have an undue influence on the regression equation.
Description: Points that stand out, especially those in the plot’s top right or bottom right, are influential for the regression equation. The Cook’s distance lines can help identify significant observations.
plot(full_lm_model, which=5)#> Warning: not plotting observations with leverage one:
#> 94, 541, 1048
Remember, no real-world data will perfectly meet all these assumptions. The key is to identify major deviations that could bias the regression results. If any of these assumptions appear to be violated, further investigation and potentially other modeling approaches or transformations might be necessary.
6.2.1.5 Cook’s Distance
Cook’s Distance is a measure used in regression analysis to identify influential data points. An influential data point is an observation or set of observations that notably affects the regression equation. Such points can exert undue leverage on the estimated regression coefficients, potentially skewing the model.
Concept: Cook’s Distance combines the leverage (how extreme the input data is) and the residuals (how extreme the output is) into a single measure.
The measure quantifies how much all the fitted values would change if we excluded a particular observation from the dataset.
Calculation: It is computed for each observation and represents the scaled change in the fitted values when the observation is left out of the regression. Mathematically, it’s a function of each observation’s residual and leverage.
Interpretation: A common rule of thumb is that any observation with a Cook’s Distance more significant than one might be influential. However, this threshold might be too strict in practice, especially for larger datasets.
Another common practice is to look for points whose Cook’s Distance exceeds four times the mean of all Cook’s Distance values.
A large Cook’s Distance value indicates that the observation strongly influences the regression coefficients. Removing this observation would significantly change the estimated regression line.
It’s crucial not to automatically exclude data points based on Cook’s Distance alone. If an observation has a high Cook’s Distance, one should investigate the reasons for its influence. It could result from data entry or measurement errors or represent a legitimate, interesting outlier.
plot(full_lm_model, which=4)6.3 Variable Selection Techniques in Multiple Regression
6.3.1 Forward Selection:
- Starting Point: Begins with no predictors in the model.
- Procedure: In each step, the predictor that results in the lowest residual sum of squares (RSS) when added to the model is included.
- Stopping Point: Continues until a pre-specified stopping rule is met, like a certain number of variables, or until adding predictors no longer improves the model by a significant amount.
- Advantages: Computationally efficient compared to best subset selection, especially when the number of predictors is large.
# Load necessary library
library(MASS)
# Fit the base model (with only the intercept)
base_model <- lm(SalePrice ~ 1, data=train_data)
# Perform forward selection
forward_selected_model <- step(base_model,
scope=list(lower=~1,
upper=~Order + Lot.Frontage + Lot.Area + Lot.Shape + Utilities +
Lot.Config + Land.Slope + Neighborhood + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + Year.Built + Year.Remod.Add + Foundation + Bsmt.Unf.SF + Total.Bsmt.SF + BaseLivArea +
Central.Air + X1st.Flr.SF + X2nd.Flr.SF + Gr.Liv.Area + Full.Bath + Half.Bath + Bathrooms + Bedroom.AbvGr + Kitchen.AbvGr + TotRms.AbvGrd + Fireplaces + Garage.Type +
Garage.Area + Wood.Deck.SF + Open.Porch.SF + Enclosed.Porch +
X3Ssn.Porch + Screen.Porch + Mo.Sold + Yr.Sold + Sale.Condition),
direction="forward",
trace=0) # trace=1 provides a step-by-step output
# View the final model
stargazer::stargazer(forward_selected_model, type = "html", out = "regression.html" ,title = "Simple Linear Model", ci=TRUE, single.row = TRUE, no.space = FALSE, align = TRUE, digits=5, font.size = "small", report = "vc*stp")| Dependent variable: | |
| SalePrice | |
| Overall.Qual | 13,479.37000*** (11,692.66000, 15,266.07000) |
| t = 14.78649 | |
| p = 0.00000 | |
| Gr.Liv.Area | 41.52258*** (33.27860, 49.76656) |
| t = 9.87178 | |
| p = 0.00000 | |
| NeighborhoodBlueste | 2,843.10700 (-23,729.50000, 29,415.71000) |
| t = 0.20970 | |
| p = 0.83393 | |
| NeighborhoodBrDale | 8,739.89200 (-11,680.53000, 29,160.31000) |
| t = 0.83886 | |
| p = 0.40165 | |
| NeighborhoodBrkSide | -11,472.78000 (-29,118.76000, 6,173.19300) |
| t = -1.27430 | |
| p = 0.20271 | |
| NeighborhoodClearCr | -13,620.00000 (-32,801.76000, 5,561.75800) |
| t = -1.39167 | |
| p = 0.16418 | |
| NeighborhoodCollgCr | -11,981.53000 (-27,428.29000, 3,465.23400) |
| t = -1.52028 | |
| p = 0.12861 | |
| NeighborhoodCrawfor | 8,512.60900 (-8,324.81200, 25,350.03000) |
| t = 0.99091 | |
| p = 0.32185 | |
| NeighborhoodEdwards | -22,691.78000*** (-38,931.64000, -6,451.92000) |
| t = -2.73864 | |
| p = 0.00623 | |
| NeighborhoodGilbert | -20,599.96000** (-36,717.68000, -4,482.23400) |
| t = -2.50502 | |
| p = 0.01233 | |
| NeighborhoodGreens | 15,846.29000 (-11,110.37000, 42,802.95000) |
| t = 1.15215 | |
| p = 0.24940 | |
| NeighborhoodGrnHill | 129,888.20000*** (70,614.59000, 189,161.80000) |
| t = 4.29493 | |
| p = 0.00002 | |
| NeighborhoodIDOTRR | -15,758.82000* (-33,892.39000, 2,374.76400) |
| t = -1.70329 | |
| p = 0.08868 | |
| NeighborhoodLandmrk | 10,907.48000 (-49,333.66000, 71,148.62000) |
| t = 0.35488 | |
| p = 0.72272 | |
| NeighborhoodMeadowV | 6,726.82500 (-11,876.50000, 25,330.15000) |
| t = 0.70871 | |
| p = 0.47859 | |
| NeighborhoodMitchel | -17,512.17000** (-33,927.43000, -1,096.90400) |
| t = -2.09093 | |
| p = 0.03667 | |
| NeighborhoodNAmes | -14,891.41000* (-30,827.83000, 1,045.00900) |
| t = -1.83144 | |
| p = 0.06719 | |
| NeighborhoodNoRidge | 36,220.40000*** (18,513.70000, 53,927.10000) |
| t = 4.00926 | |
| p = 0.00007 | |
| NeighborhoodNPkVill | 4,549.64400 (-17,547.67000, 26,646.96000) |
| t = 0.40354 | |
| p = 0.68660 | |
| NeighborhoodNridgHt | 48,342.12000*** (33,032.92000, 63,651.32000) |
| t = 6.18901 | |
| p = 0.00000 | |
| NeighborhoodNWAmes | -20,007.26000** (-36,442.10000, -3,572.42800) |
| t = -2.38600 | |
| p = 0.01713 | |
| NeighborhoodOldTown | -17,281.65000** (-34,403.30000, -159.98980) |
| t = -1.97828 | |
| p = 0.04804 | |
| NeighborhoodSawyer | -15,950.72000* (-32,518.69000, 617.25640) |
| t = -1.88694 | |
| p = 0.05932 | |
| NeighborhoodSawyerW | -19,437.82000** (-35,604.88000, -3,270.75400) |
| t = -2.35648 | |
| p = 0.01855 | |
| NeighborhoodSomerst | 8,589.75300 (-6,587.08300, 23,766.59000) |
| t = 1.10930 | |
| p = 0.26744 | |
| NeighborhoodStoneBr | 53,246.43000*** (36,270.27000, 70,222.60000) |
| t = 6.14751 | |
| p = 0.00000 | |
| NeighborhoodSWISU | -16,388.01000* (-35,511.23000, 2,735.21600) |
| t = -1.67963 | |
| p = 0.09319 | |
| NeighborhoodTimber | -11.38951 (-17,011.67000, 16,988.89000) |
| t = -0.00131 | |
| p = 0.99896 | |
| NeighborhoodVeenker | 3,344.23400 (-16,698.45000, 23,386.92000) |
| t = 0.32703 | |
| p = 0.74368 | |
| BaseLivArea | 25.57887*** (19.13377, 32.02396) |
| t = 7.77857 | |
| p = 0.00000 | |
| Bldg.Type2fmCon | -5,936.95600 (-16,252.66000, 4,378.75100) |
| t = -1.12801 | |
| p = 0.25946 | |
| Bldg.TypeDuplex | -2,557.69000 (-15,094.53000, 9,979.15300) |
| t = -0.39986 | |
| p = 0.68931 | |
| Bldg.TypeTwnhs | -45,079.30000*** (-56,046.87000, -34,111.73000) |
| t = -8.05591 | |
| p = 0.00000 | |
| Bldg.TypeTwnhsE | -32,478.51000*** (-39,499.60000, -25,457.43000) |
| t = -9.06650 | |
| p = 0.00000 | |
| Year.Built | 372.81780*** (248.85920, 496.77640) |
| t = 5.89478 | |
| p = 0.00000 | |
| Sale.ConditionAdjLand | 3,798.87300 (-18,793.65000, 26,391.39000) |
| t = 0.32956 | |
| p = 0.74177 | |
| Sale.ConditionAlloca | 1,605.97800 (-13,556.25000, 16,768.21000) |
| t = 0.20760 | |
| p = 0.83557 | |
| Sale.ConditionFamily | 6,843.60600 (-5,023.50300, 18,710.71000) |
| t = 1.13029 | |
| p = 0.25850 | |
| Sale.ConditionNormal | 6,388.35500** (937.03400, 11,839.68000) |
| t = 2.29686 | |
| p = 0.02174 | |
| Sale.ConditionPartial | 23,969.80000*** (16,429.00000, 31,510.59000) |
| t = 6.23010 | |
| p = 0.00000 | |
| Overall.Cond | 5,168.43100*** (3,703.83200, 6,633.03100) |
| t = 6.91652 | |
| p = 0.00000 | |
| Lot.Area | 0.52338*** (0.31905, 0.72772) |
| t = 5.02035 | |
| p = 0.000001 | |
| Screen.Porch | 59.51841*** (36.20029, 82.83654) |
| t = 5.00272 | |
| p = 0.000001 | |
| Garage.Area | 21.44998*** (12.96982, 29.93013) |
| t = 4.95759 | |
| p = 0.000001 | |
| House.Style1.5Unf | 10,496.06000 (-6,278.54600, 27,270.67000) |
| t = 1.22637 | |
| p = 0.22021 | |
| House.Style1Story | 11,751.13000*** (4,880.87600, 18,621.38000) |
| t = 3.35239 | |
| p = 0.00082 | |
| House.Style2.5Fin | -27,978.63000** (-55,129.02000, -828.24650) |
| t = -2.01975 | |
| p = 0.04355 | |
| House.Style2.5Unf | -5,474.84900 (-21,321.87000, 10,372.17000) |
| t = -0.67713 | |
| p = 0.49841 | |
| House.Style2Story | -7,494.88700** (-13,589.77000, -1,400.00100) |
| t = -2.41017 | |
| p = 0.01604 | |
| House.StyleSFoyer | 13,090.54000** (2,591.15400, 23,589.93000) |
| t = 2.44367 | |
| p = 0.01463 | |
| House.StyleSLvl | 3,742.10200 (-4,796.90700, 12,281.11000) |
| t = 0.85893 | |
| p = 0.39049 | |
| Lot.ShapeIR2 | 8,123.22900* (-802.94190, 17,049.40000) |
| t = 1.78366 | |
| p = 0.07464 | |
| Lot.ShapeIR3 | -29,048.06000*** (-45,007.06000, -13,089.06000) |
| t = -3.56746 | |
| p = 0.00037 | |
| Lot.ShapeReg | 1,528.94500 (-1,664.16800, 4,722.05800) |
| t = 0.93848 | |
| p = 0.34812 | |
| Fireplaces | 5,202.51400*** (2,643.28700, 7,761.74000) |
| t = 3.98431 | |
| p = 0.00008 | |
| Bathrooms | 5,996.61700*** (2,018.21700, 9,975.01700) |
| t = 2.95424 | |
| p = 0.00318 | |
| Lot.Frontage | -84.44456* (-168.94340, 0.05431) |
| t = -1.95870 | |
| p = 0.05029 | |
| Year.Remod.Add | 97.99938* (-2.04115, 198.03990) |
| t = 1.91997 | |
| p = 0.05501 | |
| Land.SlopeMod | 11,074.44000*** (4,382.30600, 17,766.56000) |
| t = 3.24344 | |
| p = 0.00121 | |
| Land.SlopeSev | -5,074.33200 (-23,705.70000, 13,557.03000) |
| t = -0.53380 | |
| p = 0.59354 | |
| Bedroom.AbvGr | -3,945.51400*** (-6,436.26200, -1,454.76600) |
| t = -3.10472 | |
| p = 0.00194 | |
| TotRms.AbvGrd | 2,363.26400*** (586.93740, 4,139.59100) |
| t = 2.60758 | |
| p = 0.00919 | |
| Kitchen.AbvGr | -13,944.52000** (-25,444.97000, -2,444.08200) |
| t = -2.37650 | |
| p = 0.01758 | |
| FoundationCBlock | -2,626.30700 (-8,581.08400, 3,328.47000) |
| t = -0.86443 | |
| p = 0.38746 | |
| FoundationPConc | 4,503.15500 (-2,144.87500, 11,151.18000) |
| t = 1.32761 | |
| p = 0.18446 | |
| FoundationSlab | 1,685.56400 (-12,309.34000, 15,680.47000) |
| t = 0.23606 | |
| p = 0.81342 | |
| FoundationStone | 8,073.04400 (-12,024.60000, 28,170.69000) |
| t = 0.78730 | |
| p = 0.43121 | |
| FoundationWood | -15,500.73000 (-45,764.57000, 14,763.11000) |
| t = -1.00387 | |
| p = 0.31557 | |
| X2nd.Flr.SF | 12.70646** (2.11361, 23.29932) |
| t = 2.35104 | |
| p = 0.01882 | |
| Lot.ConfigCulDSac | 8,242.40000** (1,526.96600, 14,957.83000) |
| t = 2.40562 | |
| p = 0.01624 | |
| Lot.ConfigFR2 | -4,743.18900 (-13,107.54000, 3,621.16500) |
| t = -1.11144 | |
| p = 0.26652 | |
| Lot.ConfigFR3 | -1,202.19600 (-18,367.77000, 15,963.38000) |
| t = -0.13727 | |
| p = 0.89084 | |
| Lot.ConfigInside | 1,917.46000 (-1,705.27200, 5,540.19100) |
| t = 1.03738 | |
| p = 0.29969 | |
| X3Ssn.Porch | 49.63295 (-10.06617, 109.33210) |
| t = 1.62948 | |
| p = 0.10337 | |
| Bsmt.Unf.SF | 5.39633 (-1.24957, 12.04224) |
| t = 1.59145 | |
| p = 0.11167 | |
| Constant | -960,652.70000*** (-1,231,564.00000, -689,741.90000) |
| t = -6.95005 | |
| p = 0.00000 | |
| Observations | 2,051 |
| R2 | 0.86704 |
| Adjusted R2 | 0.86200 |
| Residual Std. Error | 29,322.53000 (df = 1975) |
| F Statistic | 171.72850*** (df = 75; 1975) |
| Note: | p<0.1; p<0.05; p<0.01 |
6.3.2 Backward Elimination:
- Starting Point: Begins with all predictors in the model.
- Procedure: In each step, the predictor that is least significant (i.e., has the highest p-value) and does not contribute significantly to the model fit is removed.
- Stopping Point: Continues until a stopping rule is met, often when all remaining predictors are significant at a specified alpha level.
- Advantages: Like forward selection, it’s more computationally efficient than best subset selection.
# Load necessary library
library(MASS)
# Fit the full model (with all predictors)
full_model <- lm(SalePrice ~ Order + Lot.Frontage + Lot.Area + Lot.Shape + Utilities +
Lot.Config + Land.Slope + Neighborhood + Bldg.Type + House.Style +
Overall.Qual + Overall.Cond + Year.Built + Year.Remod.Add +
Foundation + Bsmt.Unf.SF + Total.Bsmt.SF + BaseLivArea +
Central.Air + X1st.Flr.SF + X2nd.Flr.SF + Gr.Liv.Area + Full.Bath +
Half.Bath + Bathrooms + Bedroom.AbvGr + Kitchen.AbvGr +
TotRms.AbvGrd + Fireplaces + Garage.Type +
Garage.Area + Wood.Deck.SF + Open.Porch.SF + Enclosed.Porch +
X3Ssn.Porch + Screen.Porch + Mo.Sold + Yr.Sold + Sale.Condition,
data=train_data)
# Perform backward selection
backward_selected_model <- step(full_model,
direction="backward",
trace=0) # trace=1 provides a step-by-step output
# View the final model
stargazer::stargazer(backward_selected_model, type = "html", out = "regression.html" ,title = "Simple Linear Model", ci=TRUE, single.row = TRUE, no.space = FALSE, align = TRUE, digits=5, font.size = "small", report = "vc*stp")| Dependent variable: | |
| SalePrice | |
| Lot.Frontage | -84.42088* (-168.94220, 0.10045) |
| t = -1.95763 | |
| p = 0.05042 | |
| Lot.Area | 0.52307*** (0.31859, 0.72755) |
| t = 5.01379 | |
| p = 0.000001 | |
| Lot.ShapeIR2 | 8,117.05100* (-812.18100, 17,046.28000) |
| t = 1.78169 | |
| p = 0.07496 | |
| Lot.ShapeIR3 | -29,053.66000*** (-45,017.04000, -13,090.28000) |
| t = -3.56717 | |
| p = 0.00037 | |
| Lot.ShapeReg | 1,530.75000 (-1,663.36000, 4,724.86000) |
| t = 0.93930 | |
| p = 0.34770 | |
| Lot.ConfigCulDSac | 8,251.54900** (1,532.03200, 14,971.07000) |
| t = 2.40683 | |
| p = 0.01619 | |
| Lot.ConfigFR2 | -4,731.23600 (-13,100.98000, 3,638.50300) |
| t = -1.10793 | |
| p = 0.26803 | |
| Lot.ConfigFR3 | -1,200.73600 (-18,370.64000, 15,969.16000) |
| t = -0.13707 | |
| p = 0.89100 | |
| Lot.ConfigInside | 1,915.98800 (-1,707.76700, 5,539.74300) |
| t = 1.03629 | |
| p = 0.30020 | |
| Land.SlopeMod | 11,087.45000*** (4,388.77000, 17,786.12000) |
| t = 3.24407 | |
| p = 0.00120 | |
| Land.SlopeSev | -5,028.89200 (-23,686.26000, 13,628.47000) |
| t = -0.52829 | |
| p = 0.59736 | |
| NeighborhoodBlueste | 2,819.96000 (-23,763.19000, 29,403.11000) |
| t = 0.20791 | |
| p = 0.83532 | |
| NeighborhoodBrDale | 8,823.58100 (-11,667.90000, 29,315.06000) |
| t = 0.84396 | |
| p = 0.39880 | |
| NeighborhoodBrkSide | -11,468.41000 (-29,119.02000, 6,182.19700) |
| t = -1.27348 | |
| p = 0.20300 | |
| NeighborhoodClearCr | -13,632.73000 (-32,820.92000, 5,555.46800) |
| t = -1.39250 | |
| p = 0.16393 | |
| NeighborhoodCollgCr | -11,971.54000 (-27,423.42000, 3,480.33600) |
| t = -1.51851 | |
| p = 0.12905 | |
| NeighborhoodCrawfor | 8,524.80200 (-8,318.54200, 25,368.14000) |
| t = 0.99198 | |
| p = 0.32133 | |
| NeighborhoodEdwards | -22,681.15000*** (-38,926.42000, -6,435.87600) |
| t = -2.73644 | |
| p = 0.00627 | |
| NeighborhoodGilbert | -20,581.13000** (-36,707.13000, -4,455.13000) |
| t = -2.50144 | |
| p = 0.01245 | |
| NeighborhoodGreens | 15,836.51000 (-11,127.59000, 42,800.60000) |
| t = 1.15112 | |
| p = 0.24983 | |
| NeighborhoodGrnHill | 130,061.10000*** (70,675.59000, 189,446.60000) |
| t = 4.29255 | |
| p = 0.00002 | |
| NeighborhoodIDOTRR | -15,760.57000* (-33,898.73000, 2,377.58600) |
| t = -1.70305 | |
| p = 0.08872 | |
| NeighborhoodLandmrk | 10,970.53000 (-49,298.42000, 71,239.48000) |
| t = 0.35676 | |
| p = 0.72131 | |
| NeighborhoodMeadowV | 6,785.24800 (-11,858.04000, 25,428.54000) |
| t = 0.71333 | |
| p = 0.47573 | |
| NeighborhoodMitchel | -17,505.57000** (-33,925.46000, -1,085.68400) |
| t = -2.08956 | |
| p = 0.03679 | |
| NeighborhoodNAmes | -14,862.30000* (-30,812.96000, 1,088.35000) |
| t = -1.82623 | |
| p = 0.06797 | |
| NeighborhoodNoRidge | 36,241.03000*** (18,525.26000, 53,956.79000) |
| t = 4.00949 | |
| p = 0.00007 | |
| NeighborhoodNPkVill | 4,562.31700 (-17,541.94000, 26,666.57000) |
| t = 0.40454 | |
| p = 0.68587 | |
| NeighborhoodNridgHt | 48,350.85000*** (33,036.85000, 63,664.84000) |
| t = 6.18819 | |
| p = 0.00000 | |
| NeighborhoodNWAmes | -19,989.76000** (-36,432.31000, -3,547.22000) |
| t = -2.38280 | |
| p = 0.01728 | |
| NeighborhoodOldTown | -17,286.82000** (-34,413.07000, -160.56650) |
| t = -1.97834 | |
| p = 0.04803 | |
| NeighborhoodSawyer | -15,924.98000* (-32,504.81000, 654.83900) |
| t = -1.88255 | |
| p = 0.05991 | |
| NeighborhoodSawyerW | -19,407.73000** (-35,589.62000, -3,225.83900) |
| t = -2.35068 | |
| p = 0.01884 | |
| NeighborhoodSomerst | 8,604.93300 (-6,578.63100, 23,788.50000) |
| t = 1.11076 | |
| p = 0.26681 | |
| NeighborhoodStoneBr | 53,257.01000*** (36,275.32000, 70,238.71000) |
| t = 6.14673 | |
| p = 0.00000 | |
| NeighborhoodSWISU | -16,399.20000* (-35,528.48000, 2,730.08000) |
| t = -1.68024 | |
| p = 0.09307 | |
| NeighborhoodTimber | -12.97331 (-17,017.55000, 16,991.60000) |
| t = -0.00150 | |
| p = 0.99881 | |
| NeighborhoodVeenker | 3,409.32900 (-16,679.05000, 23,497.71000) |
| t = 0.33264 | |
| p = 0.73945 | |
| Bldg.Type2fmCon | -5,963.46800 (-16,294.87000, 4,367.93800) |
| t = -1.13133 | |
| p = 0.25806 | |
| Bldg.TypeDuplex | -2,540.91000 (-15,085.22000, 10,003.40000) |
| t = -0.39700 | |
| p = 0.69142 | |
| Bldg.TypeTwnhs | -45,105.03000*** (-56,086.96000, -34,123.09000) |
| t = -8.04997 | |
| p = 0.00000 | |
| Bldg.TypeTwnhsE | -32,507.15000*** (-39,552.44000, -25,461.85000) |
| t = -9.04331 | |
| p = 0.00000 | |
| House.Style1.5Unf | 10,473.11000 (-6,311.75700, 27,257.97000) |
| t = 1.22294 | |
| p = 0.22150 | |
| House.Style1Story | 11,722.86000*** (4,828.52700, 18,617.19000) |
| t = 3.33265 | |
| p = 0.00088 | |
| House.Style2.5Fin | -27,978.77000** (-55,135.97000, -821.58150) |
| t = -2.01926 | |
| p = 0.04360 | |
| House.Style2.5Unf | -5,404.63300 (-21,315.42000, 10,506.16000) |
| t = -0.66577 | |
| p = 0.50564 | |
| House.Style2Story | -7,450.43500** (-13,608.96000, -1,291.90800) |
| t = -2.37112 | |
| p = 0.01783 | |
| House.StyleSFoyer | 13,046.92000** (2,510.04300, 23,583.79000) |
| t = 2.42686 | |
| p = 0.01532 | |
| House.StyleSLvl | 3,711.62500 (-4,850.44600, 12,273.70000) |
| t = 0.84964 | |
| p = 0.39564 | |
| Overall.Qual | 13,476.23000*** (11,688.02000, 15,264.44000) |
| t = 14.77061 | |
| p = 0.00000 | |
| Overall.Cond | 5,166.65000*** (3,701.26600, 6,632.03400) |
| t = 6.91044 | |
| p = 0.00000 | |
| Year.Built | 373.42820*** (248.86100, 497.99550) |
| t = 5.87559 | |
| p = 0.00000 | |
| Year.Remod.Add | 97.81441* (-2.31705, 197.94590) |
| t = 1.91461 | |
| p = 0.05569 | |
| FoundationCBlock | -2,616.14400 (-8,575.75300, 3,343.46500) |
| t = -0.86038 | |
| p = 0.38969 | |
| FoundationPConc | 4,500.01300 (-2,149.97000, 11,150.00000) |
| t = 1.32630 | |
| p = 0.18490 | |
| FoundationSlab | 1,620.10900 (-12,437.13000, 15,677.35000) |
| t = 0.22589 | |
| p = 0.82132 | |
| FoundationStone | 8,033.56600 (-12,084.04000, 28,151.17000) |
| t = 0.78267 | |
| p = 0.43392 | |
| FoundationWood | -15,525.36000 (-45,800.65000, 14,749.93000) |
| t = -1.00508 | |
| p = 0.31499 | |
| Bsmt.Unf.SF | 5.37722 (-1.28092, 12.03537) |
| t = 1.58290 | |
| p = 0.11361 | |
| BaseLivArea | 25.56636*** (19.11498, 32.01775) |
| t = 7.76719 | |
| p = 0.00000 | |
| X2nd.Flr.SF | 12.75196** (2.11889, 23.38504) |
| t = 2.35053 | |
| p = 0.01885 | |
| Gr.Liv.Area | 41.53372*** (33.28478, 49.78267) |
| t = 9.86849 | |
| p = 0.00000 | |
| Full.Bath | 6,028.74500*** (1,999.69500, 10,057.79000) |
| t = 2.93273 | |
| p = 0.00340 | |
| Half.Bath | 2,826.39400 (-1,090.75800, 6,743.54700) |
| t = 1.41420 | |
| p = 0.15747 | |
| Bedroom.AbvGr | -3,959.60400*** (-6,466.27700, -1,452.93000) |
| t = -3.09601 | |
| p = 0.00199 | |
| Kitchen.AbvGr | -13,962.13000** (-25,470.64000, -2,453.61500) |
| t = -2.37783 | |
| p = 0.01751 | |
| TotRms.AbvGrd | 2,363.35600*** (586.58330, 4,140.12900) |
| t = 2.60703 | |
| p = 0.00921 | |
| Fireplaces | 5,213.14800*** (2,644.78400, 7,781.51300) |
| t = 3.97825 | |
| p = 0.00008 | |
| Garage.Area | 21.43800*** (12.95246, 29.92354) |
| t = 4.95168 | |
| p = 0.000001 | |
| X3Ssn.Porch | 49.54326 (-10.19677, 109.28330) |
| t = 1.62543 | |
| p = 0.10424 | |
| Screen.Porch | 59.53482*** (36.20862, 82.86101) |
| t = 5.00236 | |
| p = 0.000001 | |
| Sale.ConditionAdjLand | 3,777.04400 (-18,825.20000, 26,379.29000) |
| t = 0.32753 | |
| p = 0.74331 | |
| Sale.ConditionAlloca | 1,632.42900 (-13,542.48000, 16,807.34000) |
| t = 0.21084 | |
| p = 0.83304 | |
| Sale.ConditionFamily | 6,853.62800 (-5,018.08700, 18,725.34000) |
| t = 1.13150 | |
| p = 0.25799 | |
| Sale.ConditionNormal | 6,389.47000** (936.73780, 11,842.20000) |
| t = 2.29667 | |
| p = 0.02175 | |
| Sale.ConditionPartial | 23,977.05000*** (16,433.02000, 31,521.08000) |
| t = 6.22932 | |
| p = 0.00000 | |
| Constant | -961,415.00000*** (-1,232,806.00000, -690,023.50000) |
| t = -6.94325 | |
| p = 0.00000 | |
| Observations | 2,051 |
| R2 | 0.86705 |
| Adjusted R2 | 0.86193 |
| Residual Std. Error | 29,329.89000 (df = 1974) |
| F Statistic | 169.38410*** (df = 76; 1974) |
| Note: | p<0.1; p<0.05; p<0.01 |
6.3.3 Best Subset Selection:
- Procedure: Considers all possible combinations of predictors and fits a model for each combination.
- Selection Criteria: The “best” model is selected based on a criterion like RSS, Akaike information criterion (AIC), Bayesian information criterion (BIC), or adjusted R-squared.
- Advantages: Can find the optimal model based on the selected criterion.
- Disadvantages: Computationally intensive, especially as the number of predictors grows. For example, with p predictors, there are 2^p possible models to evaluate. For this reason, it’s often not feasible for datasets with many predictors. For this dataset, I am keeping the maximum number of variables in the model to be 34, seven variables less than the entire dataset. Adjusting this number in the code below will make selecting max variables in the model easy.
# Load necessary libraries
library(leaps)
library(tidyverse)
# Perform best subset selection
# best_subsets <- regsubsets(SalePrice ~ Order + Lot.Frontage + Lot.Area + Lot.Shape + Utilities + Lot.Config + Land.Slope + Neighborhood + Bldg.Type + House.Style +
# Overall.Qual + Overall.Cond + Year.Built + Year.Remod.Add + Foundation + Bsmt.Unf.SF + Total.Bsmt.SF + BaseLivArea + Central.Air + X1st.Flr.SF + X2nd.Flr.SF + Gr.Liv.Area + Full.Bath +
# Half.Bath + Bathrooms + Bedroom.AbvGr + Kitchen.AbvGr +
# TotRms.AbvGrd + Fireplaces + Garage.Type + Garage.Area + Wood.Deck.SF + Open.Porch.SF + Enclosed.Porch + X3Ssn.Porch + Screen.Porch + Mo.Sold + Yr.Sold + Sale.Condition,
# data=train_data, nvmax=ncol(train_data)-7,really.big=T)
# # Extracting the summary
# best_subsets_summary <- summary(best_subsets)
# # Displaying the best model for each number of predictors based on RSS (or adjust for other criteria)
# data.frame(
# predictors = 1:(ncol(train_set)-1),
# adjr2 = best_subsets_summary$adjr2,
# rss = best_subsets_summary$rss,
# cp = best_subsets_summary$cp,
# bic = best_subsets_summary$bic
# ) %>%
# arrange(desc(adjr2)) %>%
# head(1)
# # Identify the best model size based on highest adjusted R-squared
# best_model_size <- which.min(best_subsets_summary$bic)
# # Extract the variables included in the best model
# best_model_vars <- names(which(summary_results$which[best_model_size, ]))
# # Construct the formula
# best_formula <- as.formula(paste("y ~", paste(best_model_vars[-1], collapse = " + ")))
# print(best_formula)In practice, the choice of method depends on: - The number of predictors (with many predictors, the best subset becomes computationally infeasible). - The primary goal (prediction accuracy vs. interpretability). - Computational resources.
While best subset selection can find the most optimal model, forward and backward selection are often preferred due to their computational efficiency and the ability to produce more straightforward, more interpretable models.
We need to consider a selection criterion to compare the best models from forward, backward, and best subset selection methods. The AIC (Akaike Information Criterion) is a commonly used criterion that penalizes the addition of unnecessary predictors to a model. A lower AIC suggests a better-fitting model. Another criterion you could use is the adjusted R^2, which adjusts R^2 based on the number of predictors in the model. A higher adjusted R^2 suggests a better-fitting model.
Let’s Extract the best models from all the procedures.
library(MASS)
library(leaps)
# Forward Selection
forward_aic <- AIC(forward_selected_model)
# Backward Selection
backward_aic <- AIC(backward_selected_model)
# Best Subset Selection
# best_model_size <- which.min(best_subset_summary$cp) # Replace cp with bic or adjr2 as needed
# best_formula <- as.formula(best_subsets$call[[2]][[2]][[best_model_size]])
# best_subset_selected_model <- lm(best_formula, data=train_data)
# best_subset_aic <- AIC(best_subset_selected_model)
# Comparing AIC values to identify the best model
# aic_values <- c(Forward=forward_aic, Backward=backward_aic, BestSubset=best_subset_aic)
aic_values <- c(Forward=forward_aic, Backward=backward_aic)
best_method <- names(which.min(aic_values))The best model selection method based on AIC is Forward
7 Prediction Over Test
# Predictions using the backward_selected_model
predicted_values_forward <- predict(forward_selected_model, newdata = test_data)
# Compare predicted values to actual SalePrice values in the test set
actual_values_test <- test_data$SalePrice
comparison <- data.frame(Actual = actual_values_test, Predicted = predicted_values_forward)
# Calculate Mean Squared Error
mse_forward <- mean((predicted_values_forward - actual_values_test)^2)
rmse <- sqrt(mse_forward)
# Display the MSE and first few rows of the comparison
head(comparison) %>% kbl(fullwidth=F)| Actual | Predicted | |
|---|---|---|
| 3 | 159000 | 182796.9 |
| 6 | 190000 | 194751.0 |
| 15 | 201800 | 216285.2 |
| 19 | 100000 | 123190.2 |
| 22 | 235876 | 247688.0 |
| 26 | 172000 | 159412.5 |
The Root Mean Squared Error for the Forward Selection Model is 29398.9693.
7.1 Visualizations
library(ggplot2)
# Plotting actual vs. predicted values
p <- ggplot(comparison, aes(x=Actual, y=Predicted)) +
geom_point(aes(color = abs(Actual - Predicted)), size = 3, alpha = 0.7) + # Color points by residual magnitude
geom_abline(intercept=0, slope=1, linetype="dashed", color="red") + # Line of perfect prediction
scale_color_viridis_c(option = "E", direction = -1) +
labs(
title = "Actual vs Predicted SalePrice",
x = "Actual SalePrice",
y = "Predicted SalePrice",
color = "Residual Magnitude"
) +
theme_minimal()
p# full_lm_model <- lm(cnt ~ .-causal-registerd, data = train_data)8 Equations
8.1 Causal
\begin{align} ln(y) &= \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2} \implies y= e^{\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}}\\ \sqrt{y} &=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\implies y = (\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})^2== b_0^2+b_1^2+2*b1b2\\ \sqrt[3]{y} &=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\\ cnt &= 161.807 +85.5765 *temp + 314.3430 *atemp - 275.1803 *hum+ 43.000* windspeed\\ log(cnt) &= 1.9205 + 0.043330139 *temp + 1.287265532 *atemp - 0.979846625 *hum+ * windspeed\\ then\; my\; Y\; is \end{align}
Intercept 1.92058028 temp 0.043330139 atemp 1.287265532 hum -0.979846625 windspeed 0.246232091